Read First

High Level Flow

1) Understand the domain expertise ask questions

2) Aquiring data

3) First glance data (can be skipped sometimes)

4) Sample if needed

5) Given a memory fitted dataframe we can future investigate

6) Consider missing values and outliers

7) Now we can do some heavy lifting exploration





Detailed Flow

2) Aquiring data

You can aquire the data from following resources:

If one feel more comfortable with other file format you can use the following

3) First glance data (can be skipped sometimes)

Once you know the buisness proccess, and you have the data in your end, You should play around with your data.

For each file format there is diffent tools.

3) If data doesn't fit the RAM:

Most of the faster, general tools work on ram so we should pick a sample of the data which can represent it

We use the following:

5) Given a memory fitted dataframe we can future investigate

By Far most usefull tool i know about is panda profiling https://github.com/JosPolfliet/pandas-profiling it all us to know the following:

  • Data metadata info , Number of records number of bytes
  • Warnings regarding the data like high correlation missing values etc
  • Identify the schema of the data aka data type and category of the variables
  • Look at means, median, standard deviation and histograms to understand the distribution
  • Check Completeness , Are critical data values missing? A database with missing data values is not unusual, but when the information missing is critical, then completeness is an issue.
  • Check Conformity, Is the data following standard data definitions? For example, are dates in a standard format? Maintaining conformity to standard formats are important to maintaining consistent structure and nomenclature for sharing and internal data management. Are your data values correct
  • note: still missing Continuous Variables plotbox
  • note: Consider practical significance, small can be sometimes usefull and big can be useless

Another cool tool allow one to create pivottables on dataframe using https://github.com/nicolaskruchten/jupyter_pivottablejs it will allow us :

  • Slice our data

Another cool tool allow one to easily do bi variate analysis on dataframe using https://github.com/ayush1997/visualize_ML it will allow us :

  • Bi-variate Analysis finds out the relationship between two variables. Here, we look for association and disassociation between variables at a pre-defined significance level.

Another cool tool which allow visuzualization on dataframe using https://github.com/altair-viz/altair_widgets it will allow us :

  • Bi-variate Analysis finds out the relationship between two variables. Here, we look for association and disassociation between variables at a pre-defined significance level.
  • Small analysis

The following issues need to be taken into considerations in this part as well

  • Check Timeliness, Is the data available when expected and needed? Timeliness depends on the user’s expectations and needs. Relevant only for resources where acquiring the data is very fact
  • Check Consistency ,Does the data across several systems reflect the same information? If data is reported across multiple systems, it should have the same information.
  • Check Integrity Is the data valid across the relationships and can all the data in a database be traced and connected? For example, in a customer database there should be a valid customer/sales relationship.

6) Investigate outliers and missing values

Sometime the missing values holds pattern in them, we should using the following:

</br>

Sometime the outliers are actualy the most intresting part of the data thus its very important part One can find outliers( Univariate and Multivariate) using the following:

after the outliers as been found one should exlore the data like in #5

7) Now we can do some heavy lifting exploration:

Now you suppose to understand the data and be able able to actual answer





stuff to check in the future: